Redwine quality Analysis by Gangadhara Naga Sai

## [1] "Names of variables "
##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"
## [1] "Dimensions of wine data"
## [1] 1599   12
## [1] "Structure of wine data"
## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  15.9 15.6 15.6 15.5 15.5 15 15 14.3 14 13.8 ...
##  $ volatile.acidity    : num  0.36 0.685 0.645 0.645 0.645 0.21 0.21 0.31 0.41 0.49 ...
##  $ citric.acid         : num  0.65 0.76 0.49 0.49 0.49 0.44 0.44 0.74 0.63 0.67 ...
##  $ residual.sugar      : num  7.5 3.7 4.2 4.2 4.2 2.2 2.2 1.8 3.8 3 ...
##  $ chlorides           : num  0.096 0.1 0.095 0.095 0.095 0.075 0.075 0.075 0.089 0.093 ...
##  $ free.sulfur.dioxide : num  22 6 10 10 10 10 10 6 6 6 ...
##  $ total.sulfur.dioxide: num  71 43 23 23 23 24 24 15 47 15 ...
##  $ density             : num  0.998 1.003 1.003 1.003 1.003 ...
##  $ pH                  : num  2.98 2.95 2.92 2.92 2.92 3.07 3.07 2.86 3.01 3.02 ...
##  $ sulphates           : num  0.84 0.68 0.74 0.74 0.74 0.84 0.84 0.79 0.81 0.93 ...
##  $ alcohol             : num  14.9 11.2 11.1 11.1 11.1 9.2 9.2 8.4 10.8 12 ...
##  $ quality             : int  5 7 5 5 5 7 7 6 6 6 ...
## [1] "Summary of Redwine data"
##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

Univariate Plots Section

Quality ranges from 0 to 10,but in data minimum is 3 and maximum is 8, which means that most of the wines we will look at in the analysis are average wines.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

Univariate Analysis

What is the structure of your dataset?

Data set is regarding the wine quality and several chemical componets that it contains.there ae 1599 samples of wine with 10 variables(fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfer dioxide, density, pH, sulphates, alcohol, quality) of type numeric and 1 rating factor quality of type int.

What is/are the main feature(s) of interest in your dataset?

Quality is the main feature of insterest ,given by 3 wine experts according to their knowledge and experience.Quality ranges from 0 to 10 but our data has least quality of 3 and highest quality of 8. Lets find out what are the main deciding factors for high quality wine.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

There can lot more features since in real world so many factors affect the quality of Red wine. >Type of grapes used >flavor (like combination of different ingredients) >Color >taste(sweet,sour,bitter,etc) >total cost from the ingredients to final production of wine(since cost matters since high quality wine with less cost really matters)

Did you create any new variables from existing variables in the dataset?

Yes , i made total.acidity and combined.sulphur.dioxide, which may show some unseen trends.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Volatile acidity is having a bimodal distribution and Citric acid has quite a long-tail distribution.But it is not a Normal Distribution. the data was already tidy so there was no requirement for any adjustment.

Bivariate Plots Section

By using correlation we can find out some important insights among variables

## [1] "Correlation among the variables"
##     volatile.acidity total.sulfur.dioxide              density 
##          -0.39055778          -0.18510029          -0.17491923 
##            chlorides                   pH  free.sulfur.dioxide 
##          -0.12890656          -0.05773139          -0.05065606 
##       residual.sugar        fixed.acidity          citric.acid 
##           0.01373164           0.12405165           0.22637251 
##            sulphates              alcohol              quality 
##           0.25139708           0.47616632           1.00000000

Observing the correlation, alcohol and volatile acidity, have a higher correlation with the quality of wine.Suphates and citric acid are also correlated with the quality of wine. Residual sugar has almost no correlation with quality.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Quality being the feature of interest,the correlation between quality and each different variable in the dataset are examined.Quality of wine is directly proportional to the alcohol content and volatile acidity and inversely proportional to density,total sulfur dioxide and chlorides.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

pH and volatile acidity are positively correleated, Higher the pH value means less acidity, but from plots a higher volatile acidity means more acidity. Density of wine has high negative correlation with the amount of alcohol in wine. I was expecting a close relation between sulphur and sulphur dioxide,there seems no relation with correlation coefficient of 0.04.

What was the strongest relationship you found?

correlation of quality with other variables

## [1] "Correlation among the variables with quality"
##     volatile.acidity total.sulfur.dioxide              density 
##          -0.39055778          -0.18510029          -0.17491923 
##            chlorides                   pH  free.sulfur.dioxide 
##          -0.12890656          -0.05773139          -0.05065606 
##       residual.sugar        fixed.acidity          citric.acid 
##           0.01373164           0.12405165           0.22637251 
##            sulphates              alcohol              quality 
##           0.25139708           0.47616632           1.00000000

From the correlations we can clearly see alcohol positiely and volatile.acidity negitively are having a strong relation with quality. And density and fixed acidity have a strong correlation.

Multivariate Plots Section

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

alcohol and volatile acidity are having a strong correlation with quality.less volatile acidity, and more alcohol gives better wine. Sulphates range from 0.5 to 1.5 & chlorides from 0.1 to 1.5 gives a high quality wine. This suggests that there is an optimal range for volumes of these two features to make the best wine.

citric acid and volatile acidity did not give any usefulresults.

Were there any interesting or surprising interactions between features?

There is a strong correlation between fixed.acidity and density.But the reason is unknown, might be depending on the propeties of wine we can conclude on it.

NO correlation between residual sugar and quality of wine.

I was expecting a close relation between sulphur and sulphur dioxide,there seems no relation with correlation coefficient of 0.04.


Final Plots and Summary

Plot One

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

Description One

Quality ranges from 0 to 10,but in data minimum is 3 and maximum is 8, which means that most of the wines we will look at in the analysis are average wines, wines 5,6 constitute of 80% of wines, while wines 7,8 only contribute 10% or more of the wine data. Due to lack of data on high quality wine we cant contribute to understand what are the main composition which leads to high quality wine.

Plot Two

Description Two

“A very high correlation is seen ,higher alcohol content will give high quality wine.Considering the highest correlated variable with quality,alcohol, and density is most correlated with alcohol negitively.Higher the quality of wine when there is more alcohol and low density.

Plot Three

Description Three

Since all of the above are the similar plots of same variables, The density of wine is inversely proportional to the alcohol present in it from scatter_smooth plot.There are few concentrations of quality 8 in scatter plot where high alcohol content and low density of wine.


Reflection

Starting from the histograms and box plots we could keenly see the distribution of points.Some were normal distributions and few were bimodal distributuions.

In bivariate analysis combination of boxplot and scatterplot gave a keen idea how each variable is correlated to each other and mainly to our feature of interest quality.

Multivariate analysis has taken a further step from bivariate analysis to get insight on quality of wine.And we understood that for a specific range of chemical component we can have high quality of wine. Sulphates range from 0.5 to 1.5 & chlorides from 0.1 to 1.5 gives a high quality wine. This suggests that there is an optimal range for volumes of these two features to make the best wine.

As the data was tidy already there was no need for data wrangling.

For future analysis i would like to collect data on few features listed below+

Since in real world so many factors affect the quality of Red wine.

Type of grapes used

flavor (like combination of different ingredients)

Color

taste(sweet,sour,bitter,etc)

total cost from the ingredients to final production of wine(since cost matters since high quality wine with less cost really matters)

Does cost of wine is dependent on quality ? then can we find a way to make better wine with less cost my understanding its relation with all the factors in real world?


References

https://en.wikibooks.org/wiki/LaTeX/Mathematics

https://www.r-bloggers.com/ggplot2-graphics-in-a-loop/

http://rstudio-pubs-static.s3.amazonaws.com/3355_d3f08cb2f71f44f2bbec8b52f0e5b5e7.html

http://www.sthda.com/english/wiki/ggplot2-easy-way-to-mix-multiple-graphs-on-the-same-page-r-software-and-data-visualization

https://www.analyticsvidhya.com/blog/2016/01/guide-data-exploration/

——————————————————— X —————————————————————-